Back

Bioinformatics Advances

Oxford University Press (OUP)

Preprints posted in the last 30 days, ranked by how well they match Bioinformatics Advances's content profile, based on 184 papers previously published here. The average preprint has a 0.16% match score for this journal, so anything above that is already an above-average fit.

1
A graph-based learning approach to predict the effects of gene perturbations on molecular phenotypes

Jin, Y.; Sverchkov, Y.; Sushkova, A.; Ohtake, M.; Emfinger, C.; Craven, M.

2026-03-23 systems biology 10.64898/2026.03.20.712202 medRxiv
Top 0.1%
19.0%
Show abstract

MotivationLarge-scale gene knockdown/knockout screens have been used to gain insight into a wide array of phenotypes and biological processes. However, conducting such experiments is expensive and labor-intensive. In this work, we present a general graph-based machine-learning approach that can predict the effects of gene perturbations on molecular phenotypes of interest given some measured phenotypic effects of other gene perturbations. The motivation for learning models that can predict the effects of gene perturbations is fourfold. Such models can (1) predict effects for unmeasured genes in cases in which cost or technical barriers preclude perturbing every gene, (2) prioritize unmeasured genes or sets of genes for subsequent perturbation experiments, (3) hypothesize mechanisms that underlie the relationships between the perturbed genes and their effects, and (4) generalize to other unmeasured phenotypes of interest. ResultsWe evaluate our approach by applying it, in conjunction with four different learning methods, to learn models for four varied phenotypes. Our empirical evaluation demonstrates that the learned models (1) show relatively high levels of predictive accuracy across the four phenotypes, (2) have better predictive accuracy than several standard baselines, (3) can often learn accurate models with small training sets, (4) benefit from having multiple sources of evidence in the input representation, (5) can, in many cases, transfer their predictive value to other phenotypes. Availability and ImplementationThe Assembled datasets and source code for this work is available at: https://github.com/Craven-Biostat-Lab/graph-molecular-phenotype-prediction

2
VaLPAS: Leveraging variation in experimental multi-omics data to elucidate protein function

Mahlich, Y.; Ross, D. H.; Monteiro, L.; McDermott, J. E.

2026-03-30 bioinformatics 10.64898/2026.03.26.712966 medRxiv
Top 0.1%
18.8%
Show abstract

MotivationDespite continuing advances in sequencing and computational function determination, large parts of the studied gene, protein, and metabolite space remain functionally undetermined. Most function assignment is driven by homology searches and annotation transfer from known and extensively studied proteins but often fails to leverage available experimental omics data generated via technologies like mass-spectrometry. ResultsThe VaLPAS (Variation-Leveraged Phenomic Association Screen) framework is available as a Python package and provides a user-friendly platform for calculation of associations between expression patterns of genes or proteins in multi-omic datasets based on various statistical and learning methods. The goal of this approach is to shed light on the functional dark matter of protein space by elucidating previously unknown functions of molecules using guilt by association with molecules of known function. We present results demonstrating the utility of VaLPAS to identify high-confidence predictions for a subset of genes/proteins of unknown function in a previously published multi-omics dataset from the oleaginous yeast, Rhodotorula toruloides. AvailabilityVaLPAS is written in Python. The code is hosted on github (https://github.com/PNNL-Predictive-Phenomics/valpas/).

3
STRmie-HD enables interruption-aware HTT repeat genotyping and somatic mosaicism profiling across sequencing platforms

Napoli, A.; Liorni, N.; Biagini, T.; Giovannetti, A.; Squitieri, A.; Miele, L.; Urbani, A.; Caputo, V.; Gasbarrini, A.; Squitieri, F.; Mazza, T.

2026-03-25 bioinformatics 10.64898/2026.03.21.713334 medRxiv
Top 0.1%
18.1%
Show abstract

Short tandem repeat expansions in exon 1 of the HTT gene drive Huntingtons disease (HD) pathogenesis, with disease onset and progression heavily influenced by somatic mosaicism and sequence interruptions. While sequencing technologies enable repeat sizing, many computational tools lack the resolution to capture subtle interruption motifs and allele-specific somatic variation. We present STRmie-HD, an alignment-free, de novo framework for interruption-aware genotyping and quantitative profiling of somatic mosaicism at single-read resolution. The tool parses individual reads to quantify uninterrupted CAG tract length, CCG repeat content, and critical interruption variants, including Loss of Interruption (LOI) and Duplication of Interruption (DOI). Validated across Illumina, PacBio SMRT, and Oxford Nanopore platforms, STRmie-HD demonstrates high concordance with reference genotypes and superior sensitivity in identifying rare interruption patterns that conventional tools often overlook. Furthermore, it implements somatic mosaicism metrics to characterize repeat dynamics, successfully distinguishing the higher somatic expansion burden in brain tissues compared to peripheral blood. STRmie-HD offers a comprehensive and extensible solution for high-resolution molecular characterization of HTT variation, providing a robust framework for patient stratification and genetic research in HD. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=72 SRC="FIGDIR/small/713334v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@17a54aforg.highwire.dtl.DTLVardef@4dcfc5org.highwire.dtl.DTLVardef@8398edorg.highwire.dtl.DTLVardef@1acefde_HPS_FORMAT_FIGEXP M_FIG Graphical Abstract: STRmie-HD flowchart. STRmie-HD is a comprehensive analytical framework that processes sequencing reads to analyze CAG/CCG trinucleotide repeats, interruption variants, and somatic mosaicism in the HTT gene. The workflow begins with sequencing reads (FASTA/FASTQ) that can undergo optional custom processing eq]based on the sequencing design. These reads are then fed into a regular expression-based engine (STRmie-HD) to identify CAG and CCG motifs. The identified motifs lead to the estimation of CAG/CCG alleles, visualized as distinct peaks representing different allele sizes, interruption variant assessment, and somatic mosaicism quantification. STRmie-HD produces an HTML output that wraps this information into a report. C_FIG

4
ChironRNA: Steric Clashes Resolution in RNA Structures via E(3)-Equivariant Diffusion

Li, J.; Wang, J.; Dokholyan, N. V.

2026-03-19 biophysics 10.64898/2026.03.18.712772 medRxiv
Top 0.1%
14.6%
Show abstract

Due to the limited resolution of experimental data, many determined RNA structures contain physically implausible geometries, such as severe steric clashes and missing atoms. Resolving these defects during RNA structure refinement remains a fundamental challenge. Structure dictates the function, so the geometric accuracy of RNA structure is critical for understanding biological mechanisms. However, traditional algorithms for correction have limitations because of the complexity of RNA structures. We propose ChironRNA, an all-atom diffusion model with E(3)-equivariant graph neural networks to perform RNA refinement by resolving steric clashes and completing missing atoms. In ChironRNA, we adopt a hierarchical approach, including both an all-atom diffusion model and a coarse-grained diffusion model where each nucleotide is represented by a five-point representation. Our pipeline consists of two stages: a training stage and a generation stage. The diffusion model regenerates clashing nucleotide atoms step by step by removing the noise predicted by EGNN. ChironRNA achieves an 80% clash reduction on more than 80% of the test set. It performs better on structures of less than 200 nucleotides, resulting in a high percentage of cases having over 80% clash reduction rate and 100% atom reconstruction rate. Our results demonstrate that ChironRNA successfully resolves steric clashes and rebuilds missing atoms with high precision, offering a robust solution where traditional fine-tuning or enumerative approaches fail.

5
GMIP-PLSR: A Nextflow Pipeline for GWAS and Multi-Omics Integration in Gene Prioritization Using PLSR

Kanchwala, M. S.; Xing, C.; Xuan, Z.

2026-04-09 bioinformatics 10.64898/2026.04.06.716845 medRxiv
Top 0.1%
12.3%
Show abstract

Genome-wide association studies (GWAS) have significantly advanced our understanding of complex traits and diseases, but their interpretive power remains limited due to challenges in identifying causal genes and pathways. Integrating GWAS with multi-omics data--such as gene expression, protein-protein interactions, and gene-pathway networks have the potential to enhance biological insights and improve gene prioritization. To fulfill this potential and need, we developed the GWAS & Multi-omics Integration Pipeline (GMIP), a flexible and scalable framework that incorporates widely used tools such as PoPS, MAGMA, and benchmarker to enrich GWAS findings. However, PoPS suffers from multicollinearity in its features, which can impact performance. To overcome this, we introduce GMIP-PLSR, an extension of GMIP that uses Partial Least Squares Regression (PLSR) to manage multicollinearity effectively. We applied GMIP-PLSR across multiple GWAS datasets, demonstrating superior performance over PoPS in most cases. In a case study on NAFLD, GMIP-PLSR, using features derived from both disease-specific scRNA-seq and general PoPS features, identified gene sets with higher heritability and stronger enrichment in known NAFLD pathways, confirming its ability to enhance GWAS findings. Built on Nextflow, GMIP is computationally efficient, adaptable to diverse research environments, and provides a robust solution for gene reprioritization in post-GWAS analyses. GMIP-PLSR is available at https://github.com/mohammedmsk/GMIP.

6
DualLoc: Full-parameter fine-tuning of cascaded dual transformers for protein subcellular localization prediction

Chen, Y. G.; Chung, W.-Y.; Chang, K. Y.

2026-03-30 bioinformatics 10.64898/2026.03.27.714699 medRxiv
Top 0.1%
12.2%
Show abstract

Accurate protein subcellular localization is essential for biological function, and mislocalization is linked to numerous diseases. While current methods like DeepLoc 2.0 employ lightweight fine-tuning of protein language models (PLMs), their ability to predict multi-compartment localization remains limited. To address this, we introduce DualLoc, a multi-label localization predictor for ten compartments. DualLoc leverages full-parameter fine-tuning of a cascaded dual-transformer architecture, built upon foundational PLMs and augmented with attention and dropout layers. We evaluated this framework using three foundational PLMs--ProtBERT, ESM-2, and ProtT5--as backbones. Cross-validation on Swiss-Prot and independent validation on the Human Protein Atlas demonstrate consistent superiority over state-of-the-art baselines. The best-performing variant, DualLoc-ProtT5, achieves 0.5872 accuracy, 0.8271 micro-F1, and 0.7811 macro-F1, with substantial gains in the Matthews correlation coefficient for the nucleus (+0.13), cell membrane (+0.13), and extracellular space (+0.07). Pointwise mutual information analysis of model outputs reveals biologically relevant compartment couplings, notably between the Golgi apparatus and endoplasmic reticulum (PMI = 0.25, P < 10-6), accurately reflecting secretory pathway coordination. DualLoc provides both a highly accurate predictive tool and a robust framework for investigating protein multi-localization mechanisms. Author summaryWhere a protein resides within a cell determines what it does. When proteins end up in the wrong location, normal cellular function breaks down--a misplacement linked to diseases like cancer and Alzheimers. While computational tools exist to predict these locations, accurately tracking proteins that multitask across multiple cellular compartments simultaneously remains a major challenge. We developed DualLoc, a new approach that predicts protein locations across ten different cellular compartments, from the nucleus to the cell membrane. By training an advanced artificial intelligence model on large protein sequence databases, our method more accurately identifies where proteins go, especially in complex, multi-location scenarios. Importantly, our analysis revealed meaningful biological patterns. We found strong predictive links between compartments that work closely together, such as the Golgi apparatus and the endoplasmic reticulum--two organelles that coordinate protein processing and transport. This suggests our model captures genuine cellular logic rather than simply memorizing data. By improving how we predict protein localization, DualLoc helps researchers better understand normal cellular function and disease mechanisms. Our method is freely available to the biomedical community.

7
Robust Random Forests for Genomic Prediction: Challenges and Remedies

Lourenco, V. M.; Ogutu, J. O.; Piepho, H.-P.

2026-04-01 bioinformatics 10.64898/2026.03.30.715203 medRxiv
Top 0.1%
12.1%
Show abstract

Data contamination--from recording errors to extreme outliers--can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings. Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation. Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets. We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance. We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests. Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination. This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction. In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train-deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction. Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective. Author summaryMachine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection. Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations. Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems. To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data. Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination. Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective. Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods. It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.

8
Correlate: A Web Application for Analyzing Gene Sets and Exploring Gene Dependencies Using CRISPR Screen Data

Deolankar, S.; Wermeling, F.

2026-04-04 bioinformatics 10.64898/2026.04.02.716070 medRxiv
Top 0.1%
10.7%
Show abstract

CRISPR screen data provides a valuable resource for understanding gene function and identifying potential drug targets. Here, we present Correlate, a freely accessible web application (https://correlate.cmm.se) that enables exploration of the Cancer Dependency Map (DepMap) CRISPR screen gene effects, hotspot mutations, and translocation/fusion data across more than 1,000 human cancer cell lines. The application supports two main use cases: (i) analysis of user-defined gene sets (e.g. CRISPR screen hits) to identify functionally linked genes based on correlations while providing an overview based on essentiality or user-provided screen statistics; and (ii) exploration of genes of interest in defined biological contexts, such as specific cancer types or mutational backgrounds, to generate hypotheses about gene function and dependencies. Additionally, Correlate supports experimental design by providing rapid overviews of gene essentiality and enabling the identification of cell lines with relevant mutational profiles. In contrast to knowledge-based approaches such as STRING and GSEA, which rely on prior biological annotations and curated interaction networks, Correlate identifies gene connections directly from functional CRISPR screen readouts, offering a complementary and data-driven perspective on gene network analysis. The application runs entirely in the browser, requires no installation or login, and integrates with the Green Listed v2.0 tool family for custom CRISPR screen design. HIGHLIGHTS{blacksquare} Interactive web-based platform for bulk correlation analysis of user-defined gene sets using DepMap CRISPR screen data, requiring no installation or programming expertise. {blacksquare}Identifies functional gene relationships from CRISPR screen readouts rather than curated annotations, offering a data-driven complement to tools such as GSEA and STRING. {blacksquare}Enables contextual exploration of gene dependencies across cancer types and mutational backgrounds, supporting hypothesis generation about gene function and therapeutic targets. {blacksquare}Supports experimental design through gene essentiality overviews, mutation and fusion analysis, and cell line identification, with optional integration of user-provided statistics from CRISPR screens, proteomics, or transcriptomics analyses.

9
SELFormerMM: multimodal molecular representation learning via SELFIES, structure, text, and knowledge graph integration

Ulusoy, E.; Bostanci, S.; Deniz, B. E.; Dogan, T.

2026-03-19 bioinformatics 10.64898/2026.03.17.712340 medRxiv
Top 0.1%
10.4%
Show abstract

MotivationMolecular representation learning is central to computational drug discovery. However, most existing models rely on single-modality inputs, such as molecular sequences or graphs, which capture only limited aspects of molecular behaviour. Yet unifying these modalities with complementary resources such as textual descriptions and biological interaction networks into a coherent multimodal framework remains non-trivial, hindering more informative and biologically grounded representations. ResultsWe introduce SELFormerMM, a multimodal molecular representation learning framework that integrates SELFIES notations with structural graphs, textual descriptions, and knowledge graph- derived biological interaction data. By aligning these heterogeneous views, SELFormerMM effectively captures complementary signals that unimodal approaches often overlook. Our performance evaluation has revealed that SELFormerMM outperforms structure-, sequence-, and knowledge-based models on multiple molecular property prediction tasks. Ablation analyses further indicate that effective cross-modal alignment and modality coverage improve the models ability to exploit complementary information. Overall, integrating SELFIES with structural, textual, and biological context enables richer molecular representations and provides a promising framework for hypothesis-driven drug discovery. AvailabilitySELFormerMM is available as a programmatic tool, together with datasets, pretrained models, and precomputed embeddings at https://github.com/HUBioDataLab/SELFormerMM. Contacttuncadogan@gmail.com

10
Performance of open-source large language models on nephrology self-assessment program

Ahangaran, M.; Jia, S.; Chitalia, S.; Athavale, A.; Francis, J. M.; O'Donnell, M. W.; Bavi, S. R.; Gupta, U. D.; Kolachalama, V. B.

2026-04-16 nephrology 10.64898/2026.04.16.26348910 medRxiv
Top 0.1%
10.3%
Show abstract

Background: Large Language Models (LLMs) have demonstrated strong performance in medical question-answering tasks, highlighting their potential for clinical decision support and medical education. However, their effectiveness in subspecialty areas such as nephrology remains underexplored. In this study, we assess the performance of open-source LLMs in answering multiple-choice questions from the Nephrology Self-Assessment Program (NephSAP) to better understand their capabilities and limitations within this specialized clinical domain. Methods: We evaluated the performance of five open-source large language models (LLMs): PodGPT which a podcast-pretrained model focused on STEMM disciplines, Llama 3.2-11B, Mistral-7B-Instruct-v0.2, Falcon3-10B-Instruct, and Gemma-2-9B-it. Each model was tested on its ability to answer multiple-choice questions derived from the NephSAP. Model performance was quantified using accuracy, defined as the proportion of correctly answered questions. In addition, the quality of the models explanatory responses was assessed using several natural language processing (NLP) metrics: Bilingual Evaluation Understudy (BLEU), Word Error Rate (WER), cosine similarity, and Flesch-Kincaid Grade Level (FKGL). For qualitative analysis, three board-certified nephrologists reviewed 40 randomly selected model responses to identify factual and clinical reasoning errors, with performance summarized as average error ratios based on the proportion of error-associated words per response. Results: Among the evaluated models, PodGPT achieved the highest accuracy (64.77%), whereas Llama showed the lowest performance with an accuracy of 45.08%. Qualitative analysis showed that PodGPT had the lowest factual error rate (0.017), while Llama and Falcon achieved the lowest reasoning error rates (0.038). Conclusions: This study highlights the importance of STEMM-based training to enhance the reasoning capabilities and reliability of LLMs in clinical contexts, supporting the development of more effective AI-driven decision-support tools in nephrology and other medical specialties.

11
Dingent: An Easily Deployable Database Retrieval and Integration Agent framework

Kong, D.; Bei, S.; Wu, Y.; Tang, B.; Zhao, W.

2026-03-20 bioinformatics 10.64898/2026.03.17.712026 medRxiv
Top 0.1%
10.2%
Show abstract

AI-driven data search and integration represent an emerging research direction. Although several LLM-based backend frameworks and agentic frameworks have emerged, significant gap remains in developing a one-stop, configurable agent framework that supports various data sources and provides a web interface for efficient data retrieval using natural language. To address this, we present Dingent, a novel and configurable agent framework that facilitates data access from various resources and enables the flexible constructions of agent applications. We demonstrate its capabilities across three distinct application scenarios, achieving promising results. The Dingent framework can be readily applied to other fields, such as earth sciences and ecology, to facilitate data discovery.

12
HalluCodon enables species-specific codon optimization using multimodal language models

Lou, Y.; Mao, S.; Wu, T.; Xia, F.; Zhang, Z.; Tian, Y.; Li, Y.; Cheng, Q.; Yan, J.; Wang, X.

2026-04-02 bioinformatics 10.64898/2026.03.31.715573 medRxiv
Top 0.1%
10.0%
Show abstract

Codon optimization is widely used in transgenic crop development, plant synthetic biology, and molecular farming to improve heterologous protein expression in plant cells. Increasing availability of plant omics data now enables optimization strategies that account for species-specific sequence features. We developed HalluCodon, a customizable framework that uses multimodal language models to design coding sequences tailored to individual plant species. The framework allows users to fine tune pre-trained protein and RNA language models with their own datasets to build species-specific codon optimization models. The current implementation includes base models trained on coding sequences and proteomes from fifteen plant species. HalluCodon generates coding sequences through a hallucination-based design strategy guided by two predictive modules that evaluate coding sequence naturalness (CodonNAT) and expression potential (CodonEXP). Benchmark tests using representative proteins show that the generated sequences reproduce host-specific codon usage patterns and support high expression levels in plant systems.

13
eSIG-Net: Accurate prediction of single-mutation induced perturbations on protein interactions using a language model

Pan, X.; Shrawat, A.; Raghavan, S.; Dong, C.; Yang, Y.; Li, Z.; Zheng, W. J.; Eckhardt, S. G.; Wu, E.; Fuxman Bass, J. I.; Jarosz, D. F.; Chen, S.; McGrail, D. J.; Sheynkman, G. M.; Huang, J. H.; Sahni, N.; Yi, S. S.

2026-03-31 bioinformatics 10.64898/2026.03.27.714913 medRxiv
Top 0.1%
10.0%
Show abstract

Most proteins exert their functions in complex with other interactors. Single mutations can exhibit a profound impact on perturbing protein interactions, leading to human disease. However, predicting the effect of single mutations on protein interactions remains a major computational challenge. Deep learning, particularly protein language models or transformers, has become an effective tool in bioinformatics for protein structure prediction. However, the functional divergence of mutations makes it difficult to predict their interaction perturbation profiles. To address this fundamental challenge, we present eSIG-Net (edgetic mutation Sequence-based Interaction Grammar Network), a novel sequence-based "Interaction Language Model" for predicting protein interaction alterations caused by single mutations. eSIG-Net combines various protein sequence embeddings, introduces a mutation-encoding module with syntax and evolutionary insights, and employs contrastive learning to evaluate mutation-induced interaction changes. eSIG-Net significantly outperforms current state-of-the-art sequence-based and structure-based prediction methods at predicting mutational impact on protein interactions. We highlight examples where eSIG-Net nominates causal variants with high confidence and elucidates their functional role under relevant biological contexts. Together, eSIG-Net is a first-in-kind "interaction language model" that can accurately predict interaction-specific rewiring by single mutations with only sequence information, and exhibits generalizability across biological contexts.

14
MHCXGraph: A Graph-Based approach to detecting T cell receptor cross-reactivity

Simoes, C. D. M. S.; Maidana, R. L. B. R.; De Assis, S. C.; Guerra, J. V. d. S.; Ribeiro-Filho, H. V.

2026-04-10 bioinformatics 10.64898/2026.04.07.717034 medRxiv
Top 0.2%
9.9%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWThe T cell receptor (TCR) recognition of multiple peptides presented by the major histocompatibility complex (MHC) is a key natural phenomenon, enabling the T cell repertoire to respond to a broad array of antigens. Despite its importance to the immune response, T cell cross-reactivity poses a major challenge for the development of novel T cell-based therapies. In this study, we present MHCXGraph, a graph-based computational approach for identifying conserved and immunologically relevant regions across multiple structures of peptides bound to MHC molecules (pMHC). Our approach provides three operational modes with user-defined parameters, allowing flexible configuration according to specific scientific needs while delivering fully interpretable results through user-friendly interfaces. We evaluated MHCXGraph across three case studies, including peptides bound to classical MHC Class I, MHC Class II, and unbound HLA alleles, demonstrating its ability to capture conserved structural determinants beyond sequence similarity. By integrating structural information with efficient graph-based analysis, MHCXGraph addresses key limitations of sequence-based methods while maintaining computational scalability. Collectively, these results indicate that MHCXGraph can be readily integrated into computational pipelines for T cell cross-reactivity discovery, especially in the context of de novo pMHC engager design and T cell-based vaccine development.

15
ATHILAfinder: a tool to detect ATHILA LTR retrotransposons in plant genomes

Bousios, A.; Primetis, E.

2026-03-22 bioinformatics 10.64898/2026.03.20.713144 medRxiv
Top 0.2%
9.2%
Show abstract

MotivationThe ATHILA lineage of LTR retrotransposons has colonised all branches of the plant tree of life. In Arabidopsis thaliana and A. lyrata, ATHILA elements have invaded centromeres, influencing the genetic and epigenetic organisation, and driving satellite evolution. To assess the broader significance of ATHILA across plants, a computational pipeline is needed to identify ATHILA elements with high efficiency. Existing tools lack this ability because they are optimised for broad transposon classification at the expense of precise annotation of lower taxonomic levels. ResultsWe present ATHILAfinder, a pipeline for accurate and large-scale discovery of ATHILA elements. ATHILAfinder uses lineage-specific sequence motifs as seeds and additional filters to build de novo intact elements. Homology-based steps rescue intact ATHILA and identify soloLTRs. A detailed identity card includes coordinates, LTR identity, coding capacity, length and other sequence features for every ATHILA. We validate ATHILAfinder in the A. thaliana Col-CEN assembly and five additional Brassicaceae species, covering four supertribes and [~]30 million years of evolution. ATHILAfinder has very low false positive rates and outperforms widely-used tools like EDTA and the deep-learning-based Inpactor2 software for both recovery and precision of ATHILA. To demonstrate its usefulness, we generate insights into ATHILA dynamics across Brassicaceae. OutlookFew computational pipelines target specific transposon lineages, yet such tools can empower their identification and downstream analyses. Our tailored approach can be adapted to other LTR retrotransposon lineages, offering new ways for high-resolution analysis of transposons.

16
GlycoDiveR: a modular R framework to analyze and visualize highly dimensional glycoproteomics data

Veth, T. S.; Riley, N. M.

2026-03-24 systems biology 10.64898/2026.03.21.713336 medRxiv
Top 0.2%
9.1%
Show abstract

Mass spectrometry-based glycoproteomics is a critical platform for understanding the complex roles of protein glycosylation in biological systems, yet visualizing multidimensional glycoproteomics datasets remains a significant bottleneck in data interpretation and communication. Glycan microheterogeneity, i.e., the potential for a glycosite to be modified by multiple glycans, defies the binary presence-absence logic used in analyses of other post-translational modifications. Instead, glycoproteomics necessitates intentionally designed data structures and visualizations that are glycoform-centric, not just site-centric. Additionally, there is a need for complementary degrees of data analysis that alternate between glycoproteome-scale patterns and glycosite-specific regulation. Several bespoke frameworks for visualizing glycoproteomics data have emerged, but they often require advanced programming expertise and are designed for a single study rather than broad application. Here, we present our efforts to harmonize post-search data analysis of glycoproteomics through a modular R framework called GlycoDiveR. This platform streamlines import, transformation, and curation of qualitative and quantitative glycopeptide identifications, including support for raw output from multiple search engines. GlycoDiveR is designed to integrate seamlessly into existing analysis workflows by enabling fast, flexible exploration of highly dimensional glycoproteomics datasets via a consistently formatted data architecture. Our goal is to offer a customizable set of glycosylation-specific visualizations with minimal coding, while keeping data accessible to users who wish to further customize their characterization strategies. It also maintains a modular design that supports the continual addition of visualizations, analyses, and export functions. Ultimately, GlycoDiveR is meant to improve accessibility of glycoproteomic-specific analyses and lower the barrier to exploring biological narratives embedded in rich glycoproteomic datasets. GlycoDiveR is open-source and freely available at https://github.com/riley-research/GlycoDiveR.

17
Evaluating FoldX5.1 for MAVISp Stability Data Collection

Vliora, A.; Tiberti, M.; Papaleo, E.

2026-04-02 bioinformatics 10.64898/2026.03.31.715598 medRxiv
Top 0.2%
8.4%
Show abstract

MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.

18
Breaking the Extraction Bottleneck: A Single AI Agent Achieves Statistical Equivalence with Human-Extracted Meta-Analysis Data Across Five Agricultural Datasets

Halpern, M.

2026-03-23 bioinformatics 10.64898/2026.02.17.706322 medRxiv
Top 0.2%
8.4%
Show abstract

BackgroundData extraction is the primary bottleneck in meta-analysis, consuming weeks of researcher time with single-extractor error rates of 17.7%. Existing LLM-based systems achieve only 26-36% accuracy on continuous outcomes, and no study has validated AI-extracted continuous data against multiple independent datasets using formal equivalence testing. MethodsA single AI agent (Claude Opus 4.6) extracted treatment means, control means, sample sizes, and variance measures from source PDFs across five published agricultural meta-analyses spanning zinc biofortification, biostimulant efficacy, biochar amendments, predator biocontrol, and elevated CO2 effects on plant mineral nutrition. Observations were matched to reference standards using an LLM-driven alignment method. Validation employed proportional TOST equivalence testing, ICC(3,1), Bland-Altman analysis, and source-type stratification. ResultsAcross five datasets, the agent produced 1,149 matched observations from 136 papers. Pearson correlations ranged from 0.984 to 0.999. Proportional TOST confirmed statistical equivalence for all five datasets (all p < 0.05). Table-sourced observations achieved 5.5x lower median error than figure-sourced observations. Aggregate effects were reproduced within 0.01-1.61 pp of published values. Independent duplicate runs confirmed extraction stability (within 0.09-0.23 pp). ConclusionsA single AI agent achieves statistical equivalence with human-extracted meta-analysis data across five independent agricultural datasets. The approach reduces extraction cost by approximately one to two orders of magnitude while maintaining accuracy sufficient for aggregate meta-analytic pooling. HighlightsO_ST_ABSWhat is already knownC_ST_ABSO_LIData extraction is the primary bottleneck in meta-analysis, with single-extractor error rates of 17.7% C_LIO_LIExisting LLM-based extraction systems achieve only 26-36% accuracy on continuous outcomes C_LIO_LINo study has validated AI extraction against multiple independent datasets using formal equivalence testing C_LI What is newO_LIA single AI agent achieves statistical equivalence with human-extracted data across five agricultural meta-analyses (1,149 observations, 136 papers) C_LIO_LILLM-driven alignment resolves the previously underappreciated bottleneck of moderator matching, improving correlations from 0.377-0.812 to 0.984-0.997 without changing extracted values C_LIO_LITable-sourced observations achieve 5.5x lower error than figure-sourced data C_LI Potential impact for RSM readersO_LIProvides a validated, reproducible workflow for AI-assisted data extraction in meta-analysis C_LIO_LIDemonstrates that most apparent "extraction error" in validation studies is actually alignment error C_LIO_LIOffers practical quality signals (source-type labeling) for downstream meta-analysts C_LI

19
PhagePickr: A bacteria-centric computational tool for designing evolution-proof phage cocktails

Oneto, A.; Okamoto, K. W.

2026-03-23 microbiology 10.64898/2026.03.23.713575 medRxiv
Top 0.3%
8.2%
Show abstract

As antibiotic resistance poses a major threat to global health, phage therapy offers an alternative to antibiotic treatments in the face of multidrug-resistant bacteria. However, host resistance to phages is also well-documented. Current computational tools for phage cocktail design do not explicitly address the evolution of phage resistance, let alone through the profiling of bacterial receptors whose variability drives much of phage resistance. We introduce PhagePickr, a computational pipeline for the automated design of phage cocktails that minimize host resistance. Unlike other tools, PhagePickr selects phages based on bacterial surface receptor similarity and prioritizes phage diversity to prevent cross-resistance. The tool uses NCBI datasets, a Nearest Neighbors algorithm, and Multiple Sequence Alignment to identify phenotypically similar hosts and ensure phylogenetic diversity in the final cocktail. We evaluated the utility of PhagePickr on ESKAPE pathogens and two understudied bacteria species. The cocktails included candidate phages predicted to target diverse receptors, comprising both lytic phages with confirmed therapeutic potential and novel candidates from similar species. We demonstrate the tools utility in generating cocktails and its capacity to scale as current databases are updated. PhagePickr provides a novel bacteria-centric framework for designing resistance-proof cocktails by exploring shared phenotypes. Author SummaryWe present PhagePickr, a novel computational tool to design bacteriophage cocktails against pathogenic bacteria. Antibiotic resistance poses a major threat to global health, and phage therapy, the use of viruses that kill bacteria, is a promising alternative treatment. However, bacteria are also under immense selective pressure to develop resistance to phages, and existing tools for automated cocktail design have yet to address this challenge. Our tool is designed to circumvent resistance through two steps: it uses bacterial receptor configurations as predictors of phage infection and constructs a cocktail that maximizes phage diversity to target multiple receptors, making it harder for bacteria to evolve simultaneous resistance. We demonstrated the utility of PhagePickr on two understudied species and the ESKAPE pathogens, a group of multidrug-resistant bacteria responsible for the majority of deaths associated with antibiotic resistance worldwide. The tool identified both well-characterized therapeutic phages and novel candidates, and is designed to scale as databases expand. Our approach represents a key step toward the rational design of evolution-proof phage cocktails for clinical use.

20
Synolog: A Scalable Synteny-Based Framework for Genome Architecture Characterization

Madrigal, G.; Catchen, J. M.

2026-04-10 bioinformatics 10.64898/2026.04.07.717040 medRxiv
Top 0.3%
8.2%
Show abstract

Detailing the genomic architecture across multiple organisms has been a task performed for decades. The continuing growth of genomic datasets not only serves as a resource for studying genome evolution but warrants the availability of scalable and user-friendly software for processing these datasets. Here, we present Synolog, a bioinformatic toolkit that can automatically identify orthologs for both protein-coding and non-coding genes, synteny clusters across two or more genomes, as well as retrogenes, and segmental duplications. Applying Synolog, we illustrate cases of local gene expansions in ecologically disparate turtle species, identify synteny clusters across hundreds of millions of years of metazoan evolution, and reconstruct chromosome-level assemblies in teleosts using the inferred synteny clusters; all using its integrated visual features. In parallel, we compare our orthogroup method to that of commonly used software and note the tradeoffs of making inferences solely based on sequence similarity versus a synteny-based approach.